Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

feat: vLLM backend #2010

Draft
wants to merge 93 commits into
base: dev
Choose a base branch
from
Draft

feat: vLLM backend #2010

wants to merge 93 commits into from

Conversation

gau-nernst
Copy link
Contributor

@gau-nernst gau-nernst commented Feb 21, 2025

Describe Your Changes

High-level design

  • vLLM is an inference engine for large-scale (many GPUs)
  • cortex will spawn an vLLM subprocess and route the requests to vLLM

cortex engines install vllm

  • Download uv to cortexcpp/python_engines/bin/uv if uv is not installed
  • (via uv) Setup venv at cortexcpp/python_engines/envs/vllm/<version>/.venv
  • (via uv) Download vllm and its deps
  • Known issues:
    • Progress streaming is not supported (since download is done via uv instead of DownloadService).
    • It's not async since we need to wait for subprocess to finish (perhaps we will need a new SubprocessService in the future which handles async WaitProcess())
    • Hence, stopping and resuming download also does not work.

Note:

  • All cached Python packages are stored in cortexcpp/python_engines/cache/uv. The purpose is that when we remove python_engines folder, we are sure that we don't leave anything behind.

cortex models start <model>

  • Spawn vllm serve

TODO:

  • cortex engines install vllm (TODO: async install in separate thread)
  • Set default engine variant
  • cortex engines load vllm
  • cortex engines list
  • cortex engines uninstall vllm: delete cortexcpp/python_engines/envs/vllm/<version>
  • cortex pull <model>
  • cortex models list
  • cortex models start <model>: spawn vllm serve
  • cortex models stop <model>
  • cortex ps
  • Chat completion
    • Non-streaming
    • Streaming
  • Embeddings
  • cortex run

Fixes Issues

Self Checklist

  • Added relevant comments, esp in complex areas
  • Updated docs (for bug fixes / features)
  • Created issues for follow-up changes or refactoring needed

@gau-nernst gau-nernst moved this from Icebox to In Progress in Menlo Mar 20, 2025
@gau-nernst gau-nernst mentioned this pull request Mar 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

vLLM backend for Cortex
3 participants